1 Abstract

In this project, we have developed a predictive model for the Health Insurance Marketplace in the US. We examined the transitions between 2014, 2015 and 2016 before and after the Affordable Care Act’s coverage-related provisions took effect in 2014. Although, close to 22.8 million people gained insurance coverage after the act, the ACA came out successful, however, 5.9 million people lost their coverage, for a net increase of 16.9 million people with insurance. With in-depth research, trend and exploratory analysis of the data released by the ‘Centers for Medicare & Medicaid Services (CMS)’ with over twelve million records, we found significant hike in the Maximum Out of Pocket (MOOP) costs that the families bear after the issuer pays the rest. Considering the various state ideologies in the country, we found that with increasing liberal score, the monthly premium rate is significantly lower than the conservative states. We also developed a classification models using C50 and rpart algorithms and evaluated the models on the test sets to evaluate their acccuracies. Subsequently, we successfully trained a neural network model on the data for years 2014-2016 to predict the plan rates for the year 2017.

2 Introduction and Background

Health insurance is a term commonly used to describe any program that helps in the payment of medical expenses. There are a number of forms of health insurance that exist in the United States, the main forms include varieties of private and public coverage. Insurance companies or Insurers in the marketplace directly underwrite insurance policies relating to life, health, accident and medical risks, determining the assignment of premiums to the plan. It is interesting to note that life and health insurers generate revenue, not only through the specific activity of insurance underwriting, but also by investing premiums and annuity contracts. As a result, these insurers provide protection at a fraction of the potential loss by merging various risks.

According to Statistics, the number of Americans with private health insurance began to fall in the late 1990s and early 2000s, until the comprehensive health care reform law enacted in March 2010 commonly referred to as the Affordable Care Act (ACA) or the “Obamacare”. With the primary goals to Make affordable health insurance available to more people, expand the medicaid program to cover all adults with income below 138% of the federal poverty level and support innovative medical care delivery methods designed to lower the costs of health care generally. The number of uninsured people fell significantly since the ACA was signed in 2010, but then leveled off in 2016, which etches a significant question mark on the changes in the rules over the last three to four years. Therefore, we aim to discover some of the possible questions by digging deep into the dataset provided by the ‘Centers for Medicare & Medicaid Services (CMS)’.

The real test of how good a healthcare plan is can be difficult to assess, but one very crude benchmark is the maximum out of pocket. In layman’s terms, that’s the maximum a subscriber has to pay if the absolute worst happens. Moreover, ‘healthcare.gov’ claims that not all states have expanded their medicaid programs according to the ACA. Hence, we also studied the impact and trends of monthly premiums and MOOP across the US states. Furthermore, as a matter of fact, adults aged 35 to 64 are the primary market for life insurance. In the United States, an estimated 62.4% of total revenue for life insurance and annuities carriers is generated from individuals in this age group, with a similar trend expected throughout the rest of the world. Any increase in the quantity of individuals in this age bracket tends to increase revenue for the industry. The global number of adults aged 35 to 64 is expected to increase in 2017. This trend bound several questions of unbalanced premium rate distribution by the insurers, leveraging the age factor of individuals to mould their business models and plans to boost their profits. Thus, in our project, we have endeavored to address all such questions and possible shortcomings currently existing in the US health insurance marketplace.

3 Dataset Explanation

The dataset used is the health insurance marketplace dataset which is compiled by the US government (Centers for Medicare & Medicaid Services) and is provided as an open source dataset. The dataset contains of 5 files namely-: 1. Business rules.csv-: This contains the basic rules of the health insurance business to be followed and verified for the policy to be valid. 2. BenefitsCostSharing.csv-: This file outlines the sharing of medicare costs between the insurance company and the policy holder. 3. PlanAttributes.csv-: This file contains the attributes and the benefits offered by the insurance plans. This file has the most sparsity. 4.Network.csv-: This file links the issuer ID to the name of the company and the network they are operating in. 5. Rate.csv-: This is the file which contains all the details regarding the rate of a policy based on different factors such as tobacco consumption, number of dependents etc.

The dataset was very messy in its nascent stage and needed lots of cleaning and imputation to be put to use. It had lots of missing values, ontological anomalies, extreme values( like INT_MAX’s, 999999, Zeroes etc.), white spaces, Interval vs single value data, duplicated rows and many more.

Most of the Categorical data was given as strings and needed to be mapped into integer data type for the purpose of using it in neural networks for prediction.

4 Importing and cleaning the data

The variables in the dataset are as follows:

  • BusinessYear: Years (2014,2015,2016)
  • StateCode: State codes of USA
  • IssuerId: Identification number of issuers
  • SourceName : Categorical identifier of source of data import(HIOS,SERFF,OPM)
  • VersionNum : Integer value for version of data import
  • ImportDate : Date of data import
  • IssuerId2 : Five-digit numeric code that identifies the issuer organization in HIOS
  • FederalTIN : Tax ID Number of issuer
  • RateEffectiveDate : Date that the foundation insurance plan base rate started
  • RateExpirationDate : Date that the foundation insurance plan base rate stopped being used
  • PlanId : Identifies an insurance plan within HIOS
  • RatingAreaId: Identifier for the geographic rating area within a state
  • Tobacco: Indicator of whether a subscriber’s tobacco use is used to determine rate eligibility
  • Age : Categorical indicator of whether a subscriber’s age is used to determine rate eligibility
  • IndividualRate : Dollar value for the insurance premium cost
  • IndividualTobaccoRate: Dollar value for the insurance premium cost applicable to a tobacco user for the insurance plan
  • Couple : Dollar value for the insurance premium cost applicable to the primary enrollee plus a secondary subscriber for the insurance plan
  • PrimarySubscriberAndOneDependent : Dollar value for the insurance premium cost applicable to the primary enrollee plus one dependent
  • PrimarySubscriberAndTwoDependents : Dollar value for the insurance premium cost applicable to the primary enrollee plus two dependents
  • PrimarySubscriberAndThreeOrMoreDependents : Dollar value for the insurance premium cost applicable to the primary enrollee plus three or more dependents
  • CoupleAndOneDependent : Dollar value for the insurance premium cost applicable to the primary enrollee plus a secondary subscriber and one dependent
  • CoupleAndTwoDependents : Dollar value for the insurance premium cost applicable to the primary enrollee plus a secondary subscriber and two dependents
  • CoupleAndThreeOrMoreDependents : Dollar value for the insurance premium cost applicable to the primary enrollee plus a secondary subscriber and three or more dependents
  • RowNumber : Integer value for template row number associated with this data record

Importing the data from "final_ratedata.csv" from overall dataset and saving it into the R dataset named “rate”.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(readr)
library(choroplethr)
## Loading required package: acs
## Loading required package: stringr
## Loading required package: XML
## 
## Attaching package: 'acs'
## The following object is masked from 'package:dplyr':
## 
##     combine
## The following object is masked from 'package:base':
## 
##     apply
library(choroplethrMaps)
library(ggplot2)
library(mapproj)
## Loading required package: maps
library(maps)
library(scales)
## 
## Attaching package: 'scales'
## The following object is masked from 'package:readr':
## 
##     col_factor
library(ggthemes)
library(magrittr)
library(dplyr)
library(reshape2)
library(arules)
## Loading required package: Matrix
## 
## Attaching package: 'arules'
## The following object is masked from 'package:dplyr':
## 
##     recode
## The following objects are masked from 'package:base':
## 
##     abbreviate, write
library(arulesViz)
## Loading required package: grid
library(C50)
library(caret)
## Loading required package: lattice
library(rpart)
library(Boruta)
## Loading required package: ranger
library(neuralnet) 
## 
## Attaching package: 'neuralnet'
## The following object is masked from 'package:dplyr':
## 
##     compute
library(ggplot2)
library(choroplethr)
library(RColorBrewer)
library(pals)
library(scales)
library(gridExtra)
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:acs':
## 
##     combine
## The following object is masked from 'package:dplyr':
## 
##     combine
library(DT)


#Read in the rate.csv
ratedata=read.csv("Rate.csv", header = TRUE)
ratecopy=ratedata

new_df <- ratedata[-which(duplicated(ratedata)), ]

# Manually remove outliers

cleanrate=new_df[-which(new_df$IndividualRate==0),]
cleanrate=cleanrate[-which(cleanrate$IndividualRate==999999),]
cleanrate=cleanrate[-which(cleanrate$IndividualRate==9999.99),]
cleanrate=cleanrate[-which(cleanrate$IndividualRate==9999.00),]
cleanrate=cleanrate[-which(cleanrate$IndividualRate==0.01),]

boxplot(cleanrate$IndividualRate, main= "boxplot- individualrate for non tobacco user")

ratedata=cleanrate

# imputing values in rate columns (other than individual rate)
ratedata$IndividualTobaccoRate[is.na(ratedata$IndividualTobaccoRate)]=ratedata$IndividualRate[is.na(ratedata$IndividualTobaccoRate)]

ratedata$Couple[is.na(ratedata$Couple)]=ratedata$IndividualRate[is.na(ratedata$Couple)]
ratedata$PrimarySubscriberAndOneDependent[is.na(ratedata$PrimarySubscriberAndOneDependent)] =ratedata$IndividualRate[is.na(ratedata$PrimarySubscriberAndOneDependent)]
ratedata$PrimarySubscriberAndTwoDependents[is.na(ratedata$PrimarySubscriberAndTwoDependents)] =ratedata$IndividualRate[is.na(ratedata$PrimarySubscriberAndTwoDependents)]
ratedata$PrimarySubscriberAndThreeOrMoreDependents[is.na(ratedata$PrimarySubscriberAndThreeOrMoreDependents)]=ratedata$IndividualRate[is.na(ratedata$PrimarySubscriberAndThreeOrMoreDependents)]
ratedata$CoupleAndOneDependent[is.na(ratedata$CoupleAndOneDependent)]=ratedata$IndividualRate[is.na(ratedata$CoupleAndOneDependent)]
ratedata$CoupleAndTwoDependents[is.na(ratedata$CoupleAndTwoDependents)]=ratedata$IndividualRate[is.na(ratedata$CoupleAndTwoDependents)]
ratedata$CoupleAndThreeOrMoreDependents[is.na(ratedata$CoupleAndThreeOrMoreDependents)]=ratedata$IndividualRate[is.na(ratedata$CoupleAndThreeOrMoreDependents)]

#write.csv(ratedata,"final_ratedata.csv")

Read the new rate data file

rate <- ratedata

Extract data from the rate file individually for all three years 2014, 2015 and 2016

rate_2014 <- rate[which(rate$BusinessYear=='2014' & rate$IndividualRate<9000),]
rate_2015 <- rate[which(rate$BusinessYear=='2015' & rate$IndividualRate<9000),]
rate_2016 <- rate[which(rate$BusinessYear=='2016' & rate$IndividualRate<9000),]

5 Exploratory Data Analysis

We explore the dataset as the first step to get an insight into how the variables relate to one another. We want to understand different distributions of each variable, and explore the trend of plan rates from 2014 to 2016. We also explore correlations between plan rates and the states in which plans are offered.

5.1 State vs. Number of plans analysis

Prepare a dataset with year 2014 grouped by statecode, calculating the number of unique plans in each state. We use group_by, rowwise and summarise on the dataset to group. Continue this analysis for each year

stateplans_2014 <- rate_2014 %>% 
  group_by(StateCode) %>%
  rowwise() %>%
  summarise(NumberofPlans = n_distinct(PlanId),
            value = n_distinct(PlanId),
            StateCode = StateCode,
            value = n_distinct(StateCode))
#Map the time of day to different fill colors
ggplot(data=stateplans_2014, aes(x=StateCode, y=NumberofPlans, fill=StateCode)) +
    geom_bar(stat="identity")

stateplans_2015 <- rate_2015 %>% 
  group_by(StateCode) %>%
  rowwise() %>%
  summarise(NumberofPlans = n_distinct(PlanId),
            value = n_distinct(PlanId),
            StateCode = StateCode,
            value = n_distinct(StateCode))
ggplot(data=stateplans_2015, aes(x=StateCode, y=NumberofPlans, fill=StateCode)) +
    geom_bar(stat="identity")

stateplans_2016 <- rate_2016 %>% 
  group_by(StateCode) %>%
  rowwise() %>%
  summarise(NumberofPlans = n_distinct(PlanId),
            value = n_distinct(PlanId),
            StateCode = StateCode,
            value = n_distinct(StateCode))
ggplot(data=stateplans_2016, aes(x=StateCode, y=NumberofPlans, fill=StateCode)) +
    geom_bar(stat="identity")

5.2 State vs. plan rates analysis

Create a map showing median plan rates in each state.

data(state.regions)

stateRates <- rate %>%
    filter(IndividualRate<9000) %>%
    group_by(StateCode) %>%
    summarise(MeanIndividualRate = mean(IndividualRate),
              value = median(IndividualRate)) %>%
    inner_join(state.regions, by=c("StateCode"="abb"))
## Warning: Column `StateCode`/`abb` joining factor and character vector,
## coercing into character vector
state_choropleth(stateRates,
                 title="Median Health Insurance Plan Individual Rate ($ / month)",
                 num_colors=1)
## Warning in self$bind(): The following regions were missing and are being
## set to NA: minnesota, california, maryland, colorado, washington, vermont,
## kentucky, massachusetts, connecticut, new york, rhode island, district of
## columbia

The above map shows that states like Wisconsin, Wyoming, New Jersy and South Carolina have the highest median individual health premium rate.

5.3 State vs. plan rate for different tobacco users

Create a map showing mean rate difference between tobaco users and non-users

#install.packages("mapproj")

rate_ind_tobacco = subset(rate, (rate$Age != "Family Option"  & rate$IndividualRate < "9000" & rate$BusinessYear == "2015" & rate$Tobacco != "No Preference"))
bystate <- rate_ind_tobacco %>%
            select(StateCode, IndividualRate, IndividualTobaccoRate)
bystatecount<-bystate %>% 
              group_by(StateCode) %>%
              summarize(mean_non_Tobacco_Rate = mean(IndividualRate),
                        mean_Tobacco_Rate = mean(IndividualTobaccoRate),
                        mean_Diff= (mean_Tobacco_Rate - mean_non_Tobacco_Rate)) %>%
              arrange(desc(mean_Diff))

bystatecount$region <- tolower(state.name[match(bystatecount$StateCode,state.abb)])


us_state_map = map_data('state')
statename<-group_by(us_state_map, region) %>% 
  summarise(long = mean(long), lat = mean(lat))

mapdata <- left_join(bystatecount, us_state_map, by="region")

p <- ggplot()+geom_polygon(data=mapdata, aes(x=long, y=lat, group = group, fill=mapdata$mean_Diff),colour="white")+
  scale_fill_continuous(low = "thistle2", high = "darkred", guide="colorbar")
p1 <- p+theme_bw()+
  labs(fill = "Mean Difference $/mon",
  title = "Mean Rate Difference Bewteen Tobacco Users and Non-Tobacco Users", x="", y="")
p2<-p1 + scale_y_continuous(breaks=c()) + scale_x_continuous(breaks=c()) + theme(panel.border =  element_blank())
p3<-p2+geom_text(data = statename, aes(x=long, y=lat, label=region),  na.rm = T, size=2)+
  coord_map()
p3

The above plot shows that the greatest difference in mean rates for Tobacco users and non-users exist in the state of Wyoming. This means that there is a huge difference in the monthly premium rate depending on if the user opts for tobacco plan or not.

Prepare the data for further analysis. Here we explore the rates for the year 2014

rate_nofam = subset(rate, (rate$Age != "Family Option" & rate$BusinessYear == "2014"))
rate_fam = subset(rate, (rate$Age == "Family Option" & rate$BusinessYear == "2014"))

rate_final_2015 = subset(rate, BusinessYear == "2015")
rate_final_2014 = subset(rate, BusinessYear == "2014")

IndividualOption <- subset(rate_final_2015, (rate_final_2015$Age != "Family Option" & rate_final_2015$IndividualRate < "9000"),
                             select = c(BusinessYear:IndividualTobaccoRate))

6 Carriers vs Plans by state

The following analysis shall show the relationship between plan rates and their carriers grouped by each state.

bystate <- IndividualOption %>%
            select(StateCode, IssuerId, PlanId, IndividualRate)
bystatecount<-bystate %>% 
              group_by(StateCode) %>%
              summarize(Carriers = length(unique(IssuerId)), 
                        PlanOffered = length(unique(PlanId)),
                        MeanIndRate= mean(IndividualRate),
                        MedianIndRate = median(IndividualRate)) %>%
              arrange(desc(PlanOffered))
head(bystatecount)
## # A tibble: 6 x 5
##   StateCode Carriers PlanOffered MeanIndRate MedianIndRate
##   <fct>        <int>       <int>       <dbl>         <dbl>
## 1 WI              32         834        467.         415. 
## 2 TX              33         716        262.         233. 
## 3 FL              29         621        308.         310. 
## 4 OH              33         538        256.          39.7
## 5 PA              32         538        328.         302. 
## 6 IL              21         466        391.         352.
##Graph1 - Carriers vs. Plan Available By State
bystatecount$Carriers <- ifelse(bystatecount$Carriers<15, "(0,15)", ifelse(bystatecount$Carriers>=25, "[25,35)","[15,25)"))
a <- ggplot(bystatecount, aes(x=reorder(StateCode, PlanOffered), y=PlanOffered), height = 40, width = 40)+
  geom_bar(aes(fill = Carriers), stat="identity")+coord_flip()+
  ggtitle("Carriers vs. Plans Available By State")+
  labs(x="State", y="Plans Available")
a

6.1 Individual plan rates vs. Age

In this analysis, we explore the relationship between Individual plan rates and Age.

#install.packages("magrittr")
rate_summary_individual_age<-rate_nofam %>% 
  group_by(Age) %>%
  summarize(Number_of_Providers = length(unique(IssuerId)), 
            Number_of_Plans_Provided = length(unique(PlanId)),
            Average_Individual_Rate = mean(IndividualRate),
            MedianIndRate = median(IndividualRate),
            Standard_deviation_Individual_Rate= sd(IndividualRate),
            Average_Individual_Rate_Tobacco = mean(IndividualTobaccoRate,na.rm=TRUE),
            Per_state_Population=length(StateCode)) %>%
  arrange(desc(Per_state_Population))

rate_summary_family<-rate_fam %>% 
  group_by(Age) %>%
  summarize(Number_of_Providers = length(unique(IssuerId)), 
            Number_of_Plans_Provided = length(unique(PlanId)),
            Average_Individual_Rate = mean(IndividualRate),
            MedianIndRate = median(IndividualRate),
            Standard_deviation_Individual_Rate= sd(IndividualRate),
            Average_Individual_Rate_Tobacco = mean(IndividualTobaccoRate,na.rm=TRUE),
            Per_state_Population=length(StateCode)) %>%
  arrange(desc(Per_state_Population))

rate_summary_plan<-rate_nofam %>% 
  group_by(IssuerId) %>%
  summarize( 
            Number_of_Plans_Provided = length(unique(PlanId)),
            Average_Individual_Rate = mean(IndividualRate),
            Average_fam2_Rate = mean(CoupleAndOneDependent),
            MedianIndRate = median(IndividualRate),
            Standard_deviation_Individual_Rate = sd(IndividualRate),
            Average_Individual_Rate_Tobacco = mean(IndividualTobaccoRate,na.rm=TRUE),
            Per_state_Population=length(StateCode)) %>%
  arrange(desc(Per_state_Population))
ggplot(rate_summary_individual_age, aes(x=Age, y=Average_Individual_Rate))+
  geom_bar(aes(fill = Age), stat="identity")+coord_flip()+ 
  ggtitle("Individual Rate per Age")+
  labs(x="Age", y="Average Rate")

In general, a positive upward trend can be seen in the Individual premium plan rates as the age of the person increases. This is quite expected.

6.2 Average rates per category of state

Here we wish to explore the relationship between

family_rate <- subset(rate, (rate$Age == "Family Option"))
Family_summary<-family_rate %>% 
  group_by(StateCode) %>%
  summarize(Number_of_Providers = length(unique(IssuerId)), 
            Number_of_Plans_Provided = length(unique(PlanId)),
            Average_Individual_Rate = mean(IndividualRate),
            MedianIndRate = median(IndividualRate),
            Standard_deviation_Individual_Rate= sd(IndividualRate),
            Average_Couple_Rate = mean(Couple),
            MedianInd_Couple_Rate = median(Couple),
            Standard_deviation_Couple_Rate= sd(Couple),
            Average_PrimarySubscriberAndOneDependent = mean(PrimarySubscriberAndOneDependent),
            MedianInd_PrimarySubscriberAndOneDependent = median(PrimarySubscriberAndOneDependent),
            Standard_deviation_PrimarySubscriberAndOneDependent= sd(PrimarySubscriberAndOneDependent),
            Average_PrimarySubscriberAndTwoDependents = mean(PrimarySubscriberAndTwoDependents),
            MedianInd_PrimarySubscriberAndTwoDependent = median(PrimarySubscriberAndTwoDependents),
            Standard_deviation_PrimarySubscriberAndTwoDependent= sd(PrimarySubscriberAndTwoDependents),
            Average_PrimarySubscriberAndThreeOrMoreDependents = mean(PrimarySubscriberAndThreeOrMoreDependents),
            MedianInd_PrimarySubscriberAndThreeOrMoreDependents = median(PrimarySubscriberAndThreeOrMoreDependents),
            Standard_deviation_PrimarySubscriberAndThreeOrMoreDependents= sd(PrimarySubscriberAndThreeOrMoreDependents),
            Average_CoupleAndOneDependent= mean(CoupleAndOneDependent),
            MedianInd_CoupleAndOneDependent = median(CoupleAndOneDependent),
            Standard_deviation_CoupleAndOneDependent= sd(CoupleAndOneDependent),
            Average_CoupleAndTwoDependents = mean(CoupleAndTwoDependents),
            MedianInd_CoupleAndTwoDependents = median(CoupleAndTwoDependents),
            Standard_deviation_CoupleAndTwoDependents= sd(CoupleAndTwoDependents),
            Average_CoupleAndThreeOrMoreDependents = mean(CoupleAndThreeOrMoreDependents),
            MedianInd_CoupleAndThreeOrMoreDependents = median(CoupleAndThreeOrMoreDependents),
            Standard_deviation_CoupleAndThreeOrMoreDependents= sd(CoupleAndThreeOrMoreDependents)
            ) %>%
  arrange(desc(Number_of_Providers))


temp1<-family_rate %>% 
   group_by(StateCode) %>%
   summarize(Average_Individual_Rate = mean(IndividualRate),
             Average_Couple_Rate = mean(Couple),
             Average_PrimarySubscriberAndOneDependent = mean(PrimarySubscriberAndOneDependent),
             Average_PrimarySubscriberAndTwoDependents = mean(PrimarySubscriberAndTwoDependents),
             Average_PrimarySubscriberAndThreeOrMoreDependents = mean(PrimarySubscriberAndThreeOrMoreDependents),
             Average_CoupleAndTwoDependents = mean(CoupleAndTwoDependents),
             Average_CoupleAndThreeOrMoreDependents = mean(CoupleAndThreeOrMoreDependents)
   ) %>%
   arrange(desc(Average_Individual_Rate))


f1 <- ggplot(Family_summary, aes(x=StateCode, y=Average_Couple_Rate))+
  geom_bar(aes(fill = StateCode), stat="identity")+
  ggtitle("Average Couple Rate per State")+
  labs(x="State", y="Rates")
 f2 <- ggplot(Family_summary, aes(x=StateCode, y=Average_PrimarySubscriberAndOneDependent))+
   geom_bar(aes(fill = StateCode), stat="identity")+
   ggtitle("Average Primary +1 dependent Rate per State")+
   labs(x="State", y="Rates")
 f3 <- ggplot(Family_summary, aes(x=StateCode, y=Average_PrimarySubscriberAndTwoDependents))+
   geom_bar(aes(fill = StateCode), stat="identity")+
   ggtitle("Average Primary + 2 dependent Rate per State")+
   labs(x="State", y="Rates")
 f4 <- ggplot(Family_summary, aes(x=StateCode, y=Average_PrimarySubscriberAndThreeOrMoreDependents))+
   geom_bar(aes(fill = StateCode), stat="identity")+
   ggtitle("Average Primary + 3 or more dependent Rate per State")+
   labs(x="State", y="Rates")
 f5 <- ggplot(Family_summary, aes(x=StateCode, y=Average_CoupleAndOneDependent))+
   geom_bar(aes(fill = StateCode), stat="identity")+
   ggtitle("Average Couple Rate +1 dependent per State")+
   labs(x="State", y="Rates")
 f6 <- ggplot(Family_summary, aes(x=StateCode, y=Average_CoupleAndTwoDependents))+
   geom_bar(aes(fill = StateCode), stat="identity")+
   ggtitle("Average Couple Rate +2 dependent per State")+
   labs(x="State", y="Rates")
 f7 <- ggplot(Family_summary, aes(x=StateCode, y=Average_CoupleAndThreeOrMoreDependents))+
   geom_bar(aes(fill = StateCode), stat="identity")+
   ggtitle("Average Couple Rate +3 or more dependent per State")+
   labs(x="State", y="Rates")
 
 temp<- melt(temp1,id.var="StateCode")
ggplot(temp, aes(x = StateCode, y = value, fill = variable)) +
   geom_bar(stat = "identity")+coord_flip()+
   ggtitle("Average Rates per category for States")+
   labs(x="State", y="Rates")

The above plot shows the distribution of average plan rates of each category of plans(Individual, couple, primary subsciber and one dependent, etc) per state.

7 Classification

7.1 Rules Associations

Here, we aim to explore the associations between different rules while considering what benefits are associated with which plan. These associations help us relate which rules are the most important ones and appear in most plans.

b_rules <- read.csv("BusinessRules.csv", na.strings=c("","NA"))
bru<- subset( b_rules, select = -c(BusinessYear, ProductId, StandardComponentId, ImportDate, IssuerId2,VersionNum,TIN, RowNumber, IssuerId ) )
bru<-bru[which(bru$SourceName!="HIOS"),]

Measuring rule importance by using support and confidence

Support and confidence are the two criteria to help us decide whether a pattern is “interesting”. By setting thresholds for these two criteria, we can easily limit the number of interesting rules or item-sets reported.

For item-sets \(X\) and \(Y\), the support of an item-set measures how frequently it appears in the data: \[support(X)=\frac{count(X)}{N},\] where N is the total number of transactions in the database and count(X) is the number of observations (transactions) containing the item-set X. Of course, the union of item-sets is an item-set itself, i.e., if \(Z={X,Y}\), then \[support(Z)=support(X,Y).\]

For a rule \(X \rightarrow Y\), the rule's confidence measures the relative accuracy of the rule: \[confidence(X \rightarrow Y)=\frac{support(X, Y)}{support(X)}\]

This measures the joint occurrence of X and Y over the X domain. If whenever X appears Y tends to be present too, we will have a high \(confidence(X\rightarrow Y)\). The ranges of the support and confidence are \(0 \leq support,\ confidence \leq 1\).

lift is a measure of the performance of a targeting model (association rule) at predicting or classifying cases as having an enhanced response (with respect to the population as a whole), measured against a random choice targeting model. \[lift(X\rightarrow Y)=\frac{confidence(X\rightarrow Y)}{support(Y)}\]

 #apriori

b_rules1<- apriori(bru, parameter = list(minlen=2, supp=0.7, conf=0.7))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.7    0.1    1 none FALSE            TRUE       5     0.7      2
##  maxlen target   ext
##      10  rules FALSE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 5835 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[304 item(s), 8336 transaction(s)] done [0.00s].
## sorting and recoding items ... [8 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 done [0.00s].
## writing ... [302 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].
b_set<-sort(b_rules1, by="lift")

We use arulesViz package to visualize the scatter plots of

plot(b_set)

plot(head(sort(b_set, by = "lift"), n=50), method = "graph", control=list(cex=.5))

The plot above represents the association rules in the form of a graph network, showing only first 50 associations

inspect(b_set[1:3])
##     lhs                                             rhs                                      support confidence     lift count
## [1] {SourceName=SERFF,                                                                                                        
##      SameSexPartnerAsSpouseIndicator=Yes}        => {DomesticPartnerAsSpouseIndicator=Yes} 0.7136516  0.9815212 1.296460  5949
## [2] {SourceName=SERFF,                                                                                                        
##      SameSexPartnerAsSpouseIndicator=Yes,                                                                                     
##      AgeDeterminationRule=Age on effective date} => {DomesticPartnerAsSpouseIndicator=Yes} 0.7048944  0.9812959 1.296163  5876
## [3] {DomesticPartnerAsSpouseIndicator=Yes}       => {SameSexPartnerAsSpouseIndicator=Yes}  0.7521593  0.9935034 1.296063  6270

The top association rule is {SourceName=SERFF,SameSexPartnerAsSpouseIndicator=Yes} => DomesticPartnerAsSpouseIndicator=Yes with 0.7136516, 0.9815212, 1.296460 as support, confidence and lift, respectively

7.2 Decision Trees

Since we do not have defined rules for classifying the dataset, we can use Decision trees on the data. We compare the performance of rpart and C50 algorithm by validating the models on test data (20%).

set.seed(1234)
train_index <- sample(seq_len(nrow(bru)), size = 0.8*nrow(bru))
bru_train<-bru[train_index, ]
bru_test<-bru[-train_index, ]

Create the C50 decision tree model and predict

C50_dec_tree <- C5.0(DentalOnlyPlan ~ SourceName+
                     AgeDeterminationRule +EnrolleeContractRateDeterminationRule+ ChildrenOnlyContractMaxChildrenRule+MinimumTobaccoFreeMonthsRule+
                    TwoParentFamilyMaxDependentsRule + DomesticPartnerAsSpouseIndicator+SingleParentFamilyMaxDependentsRule+SameSexPartnerAsSpouseIndicator + MarketCoverage, data = bru_train, na.action = na.omit)
summary(C50_dec_tree)
## 
## Call:
## C5.0.formula(formula = DentalOnlyPlan ~ SourceName +
##  + SingleParentFamilyMaxDependentsRule + SameSexPartnerAsSpouseIndicator
##  + MarketCoverage, data = bru_train, na.action = na.omit)
## 
## 
## C5.0 [Release 2.07 GPL Edition]      Sat Apr 21 19:05:32 2018
## -------------------------------
## 
## Class specified by attribute `outcome'
## 
## Read 6356 cases (11 attributes) from undefined.data
## 
## Decision tree:
## 
## MinimumTobaccoFreeMonthsRule = 60: No (0)
## MinimumTobaccoFreeMonthsRule in {12,3,4,6,999}:
## :...MarketCoverage = Individual: No (2841/39)
## :   MarketCoverage = SHOP (Small Group):
## :   :...ChildrenOnlyContractMaxChildrenRule = 3 or more: No (401/22)
## :       ChildrenOnlyContractMaxChildrenRule = 1:
## :       :...DomesticPartnerAsSpouseIndicator = No: No (21)
## :           DomesticPartnerAsSpouseIndicator = Yes: Yes (94/37)
## MinimumTobaccoFreeMonthsRule = Not Applicable:
## :...SourceName = HIOS: Yes (0)
##     SourceName = OPM: No (35)
##     SourceName = SERFF: [S1]
## 
## SubTree [S1]
## 
## EnrolleeContractRateDeterminationRule = There are rates specifically for couples and for families (not just addition of individual rates): Yes (462)
## EnrolleeContractRateDeterminationRule = A different rate (specifically for parties of two or more)for each enrollee is added together:
## :...ChildrenOnlyContractMaxChildrenRule = 3 or more: Yes (2288/551)
##     ChildrenOnlyContractMaxChildrenRule = 1:
##     :...TwoParentFamilyMaxDependentsRule in {1,2}: Yes (15)
##         TwoParentFamilyMaxDependentsRule = 3 or more: No (199/63)
## 
## 
## Evaluation on training data (6356 cases):
## 
##      Decision Tree   
##    ----------------  
##    Size      Errors  
## 
##       9  712(11.2%)   <<
## 
## 
##     (a)   (b)    <-classified as
##    ----  ----
##    3373   588    (a): class No
##     124  2271    (b): class Yes
## 
## 
##  Attribute usage:
## 
##  100.00% MinimumTobaccoFreeMonthsRule
##   52.82% MarketCoverage
##   47.48% ChildrenOnlyContractMaxChildrenRule
##   47.18% SourceName
##   46.63% EnrolleeContractRateDeterminationRule
##    3.37% TwoParentFamilyMaxDependentsRule
##    1.81% DomesticPartnerAsSpouseIndicator
## 
## 
## Time: 0.0 secs
# See docs for predict # ?C50::predict.C5.0
c50_pred<-predict(C50_dec_tree, bru_test)  

confusionMatrix(table(c50_pred, bru_test$DentalOnlyPlan))
## Confusion Matrix and Statistics
## 
##         
## c50_pred  No Yes
##      No  882  40
##      Yes 152 594
##                                           
##                Accuracy : 0.8849          
##                  95% CI : (0.8686, 0.8998)
##     No Information Rate : 0.6199          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.7638          
##  Mcnemar's Test P-Value : 1.14e-15        
##                                           
##             Sensitivity : 0.8530          
##             Specificity : 0.9369          
##          Pos Pred Value : 0.9566          
##          Neg Pred Value : 0.7962          
##              Prevalence : 0.6199          
##          Detection Rate : 0.5288          
##    Detection Prevalence : 0.5528          
##       Balanced Accuracy : 0.8950          
##                                           
##        'Positive' Class : No              
## 

We see that the accuracy of C50 model is 0.8849 and kappa value is 0.7638

set.seed(123)
bru_rpart_model<-rpart(DentalOnlyPlan~., data=bru_train, cp=0.01) 
# here we use rpart::cp = *complexity parameter* = 0.01

Create the rpart decision tree model and predict

bru_rpart_pred<-predict(bru_rpart_model, bru_test,type = 'class')
confusionMatrix(table(bru_rpart_pred, bru_test$DentalOnlyPlan))
## Confusion Matrix and Statistics
## 
##               
## bru_rpart_pred  No Yes
##            No  987  39
##            Yes  47 595
##                                           
##                Accuracy : 0.9484          
##                  95% CI : (0.9367, 0.9586)
##     No Information Rate : 0.6199          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.8909          
##  Mcnemar's Test P-Value : 0.4504          
##                                           
##             Sensitivity : 0.9545          
##             Specificity : 0.9385          
##          Pos Pred Value : 0.9620          
##          Neg Pred Value : 0.9268          
##              Prevalence : 0.6199          
##          Detection Rate : 0.5917          
##    Detection Prevalence : 0.6151          
##       Balanced Accuracy : 0.9465          
##                                           
##        'Positive' Class : No              
## 

We see that the accuracy of rpart model is 0.9484 and kappa value is 0.8909. We conclude that rpart performs better than C50 model for classyfying business rules.

7.3 Pruning

To find the best complexity parameter, we prune the tree according to the optimal cp, complexity parameter to which the rpart object will be trimmed. Instead of using the real error (e.g., \(1-R^2\), RMSE) to capture the discrepancy between the observed labels and the model-predicted labels, we will use the xerror which averages the discrepancy between observed and predicted classifications using cross-validation

set.seed(12345)
control = rpart.control(cp = 0.000, xxval = 100, minsplit = 2)
bru_rpart_ctrl_model= rpart(DentalOnlyPlan ~ ., data = bru_train, control = control)
plotcp(bru_rpart_ctrl_model)

printcp(bru_rpart_ctrl_model)
## 
## Classification tree:
## rpart(formula = DentalOnlyPlan ~ ., data = bru_train, control = control)
## 
## Variables actually used in tree construction:
## [1] CohabitationRule                     
## [2] DependentMaximumAgRule               
## [3] DomesticPartnerAsSpouseIndicator     
## [4] EnrolleeContractRateDeterminationRule
## [5] MarketCoverage                       
## [6] MinimumTobaccoFreeMonthsRule         
## [7] SourceName                           
## [8] StateCode                            
## 
## Root node error: 2657/6668 = 0.39847
## 
## n= 6668 
## 
##            CP nsplit rel error   xerror      xstd
## 1  0.74858863      0  1.000000 1.000000 0.0150464
## 2  0.05005645      1  0.251411 0.255928 0.0093005
## 3  0.03274370      2  0.201355 0.205871 0.0084336
## 4  0.02070004      3  0.168611 0.171622 0.0077573
## 5  0.01561912      4  0.147911 0.153933 0.0073744
## 6  0.00940911      7  0.094091 0.103500 0.0061112
## 7  0.00301091      8  0.084682 0.096726 0.0059162
## 8  0.00225819     10  0.078660 0.094467 0.0058494
## 9  0.00213273     11  0.076402 0.088822 0.0056786
## 10 0.00188182     14  0.070004 0.086187 0.0055968
## 11 0.00112909     16  0.066240 0.082047 0.0054654
## 12 0.00100364     18  0.063982 0.080542 0.0054167
## 13 0.00075273     21  0.060971 0.079413 0.0053798
## 14 0.00065864     23  0.059466 0.075273 0.0052422
## 15 0.00056455     27  0.056831 0.075273 0.0052422
## 16 0.00037636     29  0.055702 0.071133 0.0051003
## 17 0.00031364     34  0.053820 0.068122 0.0049943
## 18 0.00025091     40  0.051938 0.068122 0.0049943
## 19 0.00018818     43  0.051186 0.067746 0.0049808
## 20 0.00000000     47  0.050433 0.064358 0.0048581
set.seed(1234)
selected_tr <- prune(bru_rpart_ctrl_model, cp= bru_rpart_ctrl_model$cptable[which.min(bru_rpart_ctrl_model$cptable[,"xerror"]),"CP"])
bru_pred_tune<-predict(selected_tr, bru_test,type = 'class')
confusionMatrix(table(bru_pred_tune, bru_test$DentalOnlyPlan))
## Confusion Matrix and Statistics
## 
##              
## bru_pred_tune   No  Yes
##           No  1008   28
##           Yes   26  606
##                                          
##                Accuracy : 0.9676         
##                  95% CI : (0.958, 0.9756)
##     No Information Rate : 0.6199         
##     P-Value [Acc > NIR] : <2e-16         
##                                          
##                   Kappa : 0.9313         
##  Mcnemar's Test P-Value : 0.8918         
##                                          
##             Sensitivity : 0.9749         
##             Specificity : 0.9558         
##          Pos Pred Value : 0.9730         
##          Neg Pred Value : 0.9589         
##              Prevalence : 0.6199         
##          Detection Rate : 0.6043         
##    Detection Prevalence : 0.6211         
##       Balanced Accuracy : 0.9653         
##                                          
##        'Positive' Class : No             
## 

After pruning, the accuracy of rpart model increased to 0.9676 and kappa value increased to 0.9313. We are able to create a fairly accurate classification model with rpart algorithm.

8 Feature Selection

8.1 Pre-processing : Clean and prepare data for feature selection.

Convert Age to factor variable and sourcename to numeric factor variable. Also, create a new column date_diff as the difference between Rate expiration date and Rate effective date

#create smaller data set of 50000 samples
smalldata <- rate[sample(nrow(rate),80000),]
#preprocesssing of smalldata
smalldata$RatingAreaId= gsub("Rating Area ", "", smalldata$RatingAreaId)
smalldata$Tobacco= gsub("No Preference", "0", smalldata$Tobacco)
smalldata$Tobacco= gsub("Tobacco User/Non-Tobacco User", "1", smalldata$Tobacco)
smalldata$Age= gsub("65 and over", "65", smalldata$Age)
smalldata$Age= gsub("0-20", "10", smalldata$Age)
smalldata$Age= gsub("Family Option", "0", smalldata$Age)
smalldata$SourceName= gsub("HIOS", "0", smalldata$SourceName)
smalldata$SourceName= gsub("OPM", "1", smalldata$SourceName)
smalldata$SourceName= gsub("SERFF", "2", smalldata$SourceName)
smalldata$date_diff <- as.Date(as.character(smalldata$RateExpirationDate), format="%Y-%m-%d")-
  as.Date(as.character(smalldata$RateEffectiveDate), format="%Y-%m-%d")
smalldata$FederalTIN= gsub("-", "", smalldata$FederalTIN)
colnames(smalldata)
##  [1] "BusinessYear"                             
##  [2] "StateCode"                                
##  [3] "IssuerId"                                 
##  [4] "SourceName"                               
##  [5] "VersionNum"                               
##  [6] "ImportDate"                               
##  [7] "IssuerId2"                                
##  [8] "FederalTIN"                               
##  [9] "RateEffectiveDate"                        
## [10] "RateExpirationDate"                       
## [11] "PlanId"                                   
## [12] "RatingAreaId"                             
## [13] "Tobacco"                                  
## [14] "Age"                                      
## [15] "IndividualRate"                           
## [16] "IndividualTobaccoRate"                    
## [17] "Couple"                                   
## [18] "PrimarySubscriberAndOneDependent"         
## [19] "PrimarySubscriberAndTwoDependents"        
## [20] "PrimarySubscriberAndThreeOrMoreDependents"
## [21] "CoupleAndOneDependent"                    
## [22] "CoupleAndTwoDependents"                   
## [23] "CoupleAndThreeOrMoreDependents"           
## [24] "RowNumber"                                
## [25] "date_diff"
data_for_boruta <-smalldata[ , -which(names(smalldata) %in% c("BusinessYear","ImportDate", "IssuerId2","FederalTIN","RateEffectiveDate","RateExpirationDate", "PlanId","IndividualTobaccoRate", "PrimarySubscriberAndOneDependent","PrimarySubscriberAndTwoDependents","PrimarySubscriberAndThreeOrMoreDependents","CoupleAndOneDependent","CoupleAndTwoDependents","CoupleAndThreeOrMoreDependents","RowNumber","date_diff", "Couple"))]
str(data_for_boruta)
## 'data.frame':    80000 obs. of  8 variables:
##  $ StateCode     : Factor w/ 39 levels "AK","AL","AR",..: 20 27 19 27 34 27 4 11 38 34 ...
##  $ IssuerId      : int  11512 28162 23603 28162 61315 86728 51485 36096 31274 87226 ...
##  $ SourceName    : chr  "0" "2" "2" "2" ...
##  $ VersionNum    : int  9 4 7 4 5 4 13 10 10 6 ...
##  $ RatingAreaId  : chr  "14" "12" "3" "14" ...
##  $ Tobacco       : chr  "1" "0" "0" "0" ...
##  $ Age           : chr  "44" "22" "59" "65" ...
##  $ IndividualRate: num  321.7 424 46.1 1342.6 30.1 ...

Begin feature selection

set.seed(123)
ppmi_boruta<-Boruta(IndividualRate~., data=data_for_boruta, doTrace=0)
## Growing trees.. Progress: 64%. Estimated remaining time: 17 seconds.
## Computing permutation importance.. Progress: 59%. Estimated remaining time: 21 seconds.
## Growing trees.. Progress: 65%. Estimated remaining time: 16 seconds.
## Computing permutation importance.. Progress: 59%. Estimated remaining time: 21 seconds.
## Growing trees.. Progress: 70%. Estimated remaining time: 13 seconds.
## Computing permutation importance.. Progress: 54%. Estimated remaining time: 26 seconds.
## Growing trees.. Progress: 70%. Estimated remaining time: 13 seconds.
## Computing permutation importance.. Progress: 60%. Estimated remaining time: 20 seconds.
## Growing trees.. Progress: 69%. Estimated remaining time: 13 seconds.
## Computing permutation importance.. Progress: 52%. Estimated remaining time: 28 seconds.
## Growing trees.. Progress: 60%. Estimated remaining time: 20 seconds.
## Computing permutation importance.. Progress: 55%. Estimated remaining time: 25 seconds.
## Growing trees.. Progress: 62%. Estimated remaining time: 19 seconds.
## Computing permutation importance.. Progress: 55%. Estimated remaining time: 25 seconds.
## Growing trees.. Progress: 62%. Estimated remaining time: 19 seconds.
## Computing permutation importance.. Progress: 56%. Estimated remaining time: 24 seconds.
## Growing trees.. Progress: 63%. Estimated remaining time: 18 seconds.
## Computing permutation importance.. Progress: 54%. Estimated remaining time: 25 seconds.
## Growing trees.. Progress: 64%. Estimated remaining time: 17 seconds.
## Computing permutation importance.. Progress: 58%. Estimated remaining time: 22 seconds.
print(ppmi_boruta)
## Boruta performed 10 iterations in 18.04925 mins.
##  7 attributes confirmed important: Age, IssuerId, RatingAreaId,
## SourceName, StateCode and 2 more;
##  No attributes deemed unimportant.
ppmi_boruta$ImpHistory[1:6, 1:10]
##      StateCode IssuerId SourceName VersionNum RatingAreaId  Tobacco
## [1,]  116.4772 118.9919   62.05204   117.0608     40.31077 206.1937
## [2,]  124.0483 120.0399   62.95507   118.8708     42.65647 201.1707
## [3,]  124.8317 124.5641   62.66332   117.2797     41.86787 200.3221
## [4,]  121.8676 119.4734   65.23306   121.1380     40.70576 191.2189
## [5,]  122.5822 121.4917   66.88530   122.7633     43.36275 204.0690
## [6,]  123.0402 127.2608   63.23441   119.0927     42.58185 205.3681
##           Age shadowMax shadowMean shadowMin
## [1,] 394.7237 2.3047120  0.6090962 -1.960094
## [2,] 404.0171 1.0706027 -1.2910484 -3.039002
## [3,] 412.0495 0.1303121 -0.5714952 -1.742969
## [4,] 385.2483 3.2791698  1.1732179 -0.619702
## [5,] 377.3597 1.5825932 -0.2637642 -3.072958
## [6,] 409.9142 3.1580682  0.8119972 -1.209018

Plot the feature importance box plot as computed from Boruta above.

plot(ppmi_boruta, xaxt="n", xlab="")
lz<-lapply(1:ncol(ppmi_boruta$ImpHistory), function(i)
  ppmi_boruta$ImpHistory[is.finite(ppmi_boruta$ImpHistory[, i]), i])
names(lz)<-colnames(ppmi_boruta$ImpHistory)
lb<-sort(sapply(lz, median))
axis(side=1, las=2, labels=names(lb), at=1:ncol(ppmi_boruta$ImpHistory), cex.axis=.8, font = 4)

Compute tentaive and confirm variables from boruta

final.ppmi_boruta<-TentativeRoughFix(ppmi_boruta)
## Warning in TentativeRoughFix(ppmi_boruta): There are no Tentative
## attributes! Returning original object.
print(final.ppmi_boruta)
## Boruta performed 10 iterations in 18.04925 mins.
##  7 attributes confirmed important: Age, IssuerId, RatingAreaId,
## SourceName, StateCode and 2 more;
##  No attributes deemed unimportant.
final.ppmi_boruta$finalDecision
##    StateCode     IssuerId   SourceName   VersionNum RatingAreaId 
##    Confirmed    Confirmed    Confirmed    Confirmed    Confirmed 
##      Tobacco          Age 
##    Confirmed    Confirmed 
## Levels: Tentative Confirmed Rejected
getConfirmedFormula(final.ppmi_boruta)
## IndividualRate ~ StateCode + IssuerId + SourceName + VersionNum + 
##     RatingAreaId + Tobacco + Age
## <environment: 0x7fe4c5e05c70>
print(final.ppmi_boruta$finalDecision[final.ppmi_boruta$finalDecision %in% c("Confirmed", "Tentative")])
##    StateCode     IssuerId   SourceName   VersionNum RatingAreaId 
##    Confirmed    Confirmed    Confirmed    Confirmed    Confirmed 
##      Tobacco          Age 
##    Confirmed    Confirmed 
## Levels: Tentative Confirmed Rejected
impBoruta <- final.ppmi_boruta$finalDecision[final.ppmi_boruta$finalDecision %in% c("Confirmed")]; length(impBoruta)
## [1] 7

9 Prediction

We use Neural Network to create the predictive model. We train the neural net on the data from

Convert some variables to numeric.

data_for_boruta$SourceName=as.numeric(data_for_boruta$SourceName)
data_for_boruta$RatingAreaId=as.numeric(data_for_boruta$RatingAreaId)
data_for_boruta$Tobacco=as.numeric(data_for_boruta$Tobacco)
data_for_boruta$Age=as.numeric(data_for_boruta$Age)

data_for_boruta$StateCode=as.numeric(data_for_boruta$StateCode)
str(data_for_boruta)
## 'data.frame':    80000 obs. of  8 variables:
##  $ StateCode     : num  20 27 19 27 34 27 4 11 38 34 ...
##  $ IssuerId      : int  11512 28162 23603 28162 61315 86728 51485 36096 31274 87226 ...
##  $ SourceName    : num  0 2 2 2 0 2 0 2 2 0 ...
##  $ VersionNum    : int  9 4 7 4 5 4 13 10 10 6 ...
##  $ RatingAreaId  : num  14 12 3 14 23 12 4 11 1 19 ...
##  $ Tobacco       : num  1 0 0 0 0 0 0 1 1 1 ...
##  $ Age           : num  44 22 59 65 53 23 50 36 39 34 ...
##  $ IndividualRate: num  321.7 424 46.1 1342.6 30.1 ...
# neural net
normalize <- function(x) {
  return((x - min(x)) / (max(x) - min(x)))
}
db_norm<-as.data.frame(lapply(data_for_boruta, normalize))

summary(db_norm)
##    StateCode         IssuerId        SourceName       VersionNum    
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.2632   1st Qu.:0.2185   1st Qu.:0.0000   1st Qu.:0.1304  
##  Median :0.5263   Median :0.4391   Median :0.0000   Median :0.2609  
##  Mean   :0.5282   Mean   :0.4682   Mean   :0.3093   Mean   :0.2635  
##  3rd Qu.:0.7895   3rd Qu.:0.7368   3rd Qu.:1.0000   3rd Qu.:0.3478  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##   RatingAreaId        Tobacco            Age         IndividualRate    
##  Min.   :0.00000   Min.   :0.0000   Min.   :0.0000   Min.   :0.000000  
##  1st Qu.:0.04545   1st Qu.:0.0000   1st Qu.:0.4769   1st Qu.:0.006143  
##  Median :0.10606   Median :0.0000   Median :0.6462   Median :0.059485  
##  Mean   :0.18100   Mean   :0.4091   Mean   :0.6484   Mean   :0.064877  
##  3rd Qu.:0.21212   3rd Qu.:1.0000   3rd Qu.:0.8308   3rd Qu.:0.095313  
##  Max.   :1.00000   Max.   :1.0000   Max.   :1.0000   Max.   :1.000000

Prepare the training and testing set for the model.

db_norm=db_norm[sample(nrow(db_norm),10000),]
sub<-sample(nrow(db_norm), floor(nrow(db_norm)*0.75))
train1<-db_norm[sub, ]
test<-db_norm[-sub, ]
final_test<-test

We train the Neural Network on data from 2014-2016 and do the predictions for the new set of plan rates from 2017.

9.1 Train the model

set.seed(1234)
net_model1<-neuralnet(IndividualRate~ StateCode+ IssuerId+ SourceName+ VersionNum+ RatingAreaId+Tobacco +Age , data=train1 , hidden= c(3,2))
model1_pred<-compute(net_model1, final_test[,c(1:7)])
pred_results<-model1_pred$net.result
cor(pred_results, final_test$IndividualRate)
##              [,1]
## [1,] 0.6401599288

Our model is very simple with few predictors deemed important by Boruta. The accuracy of prediction on the validation is just 0.60.

plot(net_model1)

9.2 Forecasting

We borrowed the data of 2017 from the website of CMS and used that data as a validation set for our model.

val_2017 <- read.csv("Rate_2017.csv")
colnames(val_2017)
##  [1] "BusinessYear"                             
##  [2] "StateCode"                                
##  [3] "IssuerId"                                 
##  [4] "SourceName"                               
##  [5] "ImportDate"                               
##  [6] "FederalTIN"                               
##  [7] "RateEffectiveDate"                        
##  [8] "RateExpirationDate"                       
##  [9] "PlanId"                                   
## [10] "RatingAreaId"                             
## [11] "Tobacco"                                  
## [12] "Age"                                      
## [13] "IndividualRate"                           
## [14] "IndividualTobaccoRate"                    
## [15] "Couple"                                   
## [16] "PrimarySubscriberAndOneDependent"         
## [17] "PrimarySubscriberAndTwoDependents"        
## [18] "PrimarySubscriberAndThreeOrMoreDependents"
## [19] "CoupleAndOneDependent"                    
## [20] "CoupleAndTwoDependents"                   
## [21] "CoupleAndThreeOrMoreDependents"

We do not need all the columns here. Let’s take the important ones.

cleaned_val_2017<-subset(val_2017, select=c("RatingAreaId","StateCode","IssuerId","SourceName","Tobacco","Age","IndividualRate"))
cleaned_val_2017$SourceName=as.numeric(cleaned_val_2017$SourceName)
cleaned_val_2017$RatingAreaId=as.numeric(cleaned_val_2017$RatingAreaId)
cleaned_val_2017$Tobacco=as.numeric(cleaned_val_2017$Tobacco)
cleaned_val_2017$Age=as.numeric(cleaned_val_2017$Age)
cleaned_val_2017$StateCode=as.numeric(cleaned_val_2017$StateCode)
normalize <- function(x) {
  return((x - min(x)) / (max(x) - min(x)))
}
cval<-as.data.frame(lapply(cleaned_val_2017[,c(1:7)], normalize))
pred_rate <- compute(net_model1, cval)$net.result
plot(pred_rate, cleaned_val_2017$IndividualRate, xlim=c(0, 12), ylim=c(0, 12)); abline(0,1, col="red", lty=2)
legend("bottomright",  c("Pred vs. Actual SQRT", "Pred=Actual Line"), cex=0.8, lty=c(1,2), lwd=c(2,2),col=c("black","red"))

We see that the predictive power of the model is bad and is unable to predict the Individual plan rate for the year 2017.

10 Maximum Out of Pocket Cost (MOOP) Analysis

p_a = read.csv("PlanAttributes.csv", stringsAsFactors = FALSE)

The real test of how good a healthcare plan is can be difficult to assess, but one very crude benchmark is the maximum out of pocket. In layman’s terms, that’s the maximum a subscriber has to pay if the absolute worst happens. In this case, I’m looking at family MOOP. God forbid, say a family slid off the road during a snow storm and several people got hurt, they needed expensive surgery and rehab. The max out of pocket is the amount of money that the family would have to pay before the insurance covers everything 100%. We want to take a look at that number.

Here’s a quick glimpse and then some data cleaning.

head(p_a$TEHBInnTier1FamilyMOOP, 50)
##  [1] ""        ""        ""        ""        ""        ""        "$12,700"
##  [8] "$8,000"  "$8,000"  "$12,700" "$0"      "$9,500"  "$12,000" "$0"     
## [15] ""        ""        ""        "$9,500"  "$12,000" "$12,700" "$12,700"
## [22] "$0"      ""        ""        ""        ""        ""        "$12,700"
## [29] "$12,000" "$9,500"  "$12,000" "$12,700" "$10,400" "$2,500"  "$1,000" 
## [36] ""        "$9,500"  "$0"      "$12,700" "$10,400" "$2,500"  "$1,000" 
## [43] "$12,700" ""        ""        ""        ""        ""        ""       
## [50] ""
p_a$TEHBInnTier1FamilyMOOP<- gsub(',', '', p_a$TEHBInnTier1FamilyMOOP)
p_a$TEHBInnTier1FamilyMOOP<- gsub('\\$', '', p_a$TEHBInnTier1FamilyMOOP)
p_a$moop<- as.numeric(p_a$TEHBInnTier1FamilyMOOP)
## Warning: NAs introduced by coercion
p_a$moop[is.na(p_a$moop)] <- 0
ggplot(p_a, aes(x = p_a$moop)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

There’s a lot of plans in there that have a zero family MOOP. That’s not accurate. We will only stick to plans that actually have a dollar amount.

We are going to map this to see which states have the worst MOOP on average for a family. We have used a function that turns state abbreviations to a format that choropleth can actually use.

stateFromLower <-function(x) {
   #read 52 state codes into local variable [includes DC (Washington D.C. and PR (Puerto Rico)]
  st.codes<-data.frame(
                      state=as.factor(c("AK", "AL", "AR", "AZ", "CA", "CO", "CT", "DC", "DE", "FL", "GA",
                                         "HI", "IA", "ID", "IL", "IN", "KS", "KY", "LA", "MA", "MD", "ME",
                                         "MI", "MN", "MO", "MS",  "MT", "NC", "ND", "NE", "NH", "NJ", "NM",
                                         "NV", "NY", "OH", "OK", "OR", "PA", "PR", "RI", "SC", "SD", "TN",
                                         "TX", "UT", "VA", "VT", "WA", "WI", "WV", "WY")),
                      full=as.factor(c("alaska","alabama","arkansas","arizona","california","colorado",
                                       "connecticut","district of columbia","delaware","florida","georgia",
                                       "hawaii","iowa","idaho","illinois","indiana","kansas","kentucky",
                                       "louisiana","massachusetts","maryland","maine","michigan","minnesota",
                                       "missouri","mississippi","montana","north carolina","north dakota",
                                       "nebraska","new hampshire","new jersey","new mexico","nevada",
                                       "new york","ohio","oklahoma","oregon","pennsylvania","puerto rico",
                                       "rhode island","south carolina","south dakota","tennessee","texas",
                                       "utah","virginia","vermont","washington","wisconsin",
                                       "west virginia","wyoming"))
                       )
     #create an nx1 data.frame of state codes from source column
  st.x<-data.frame(state=x)
     #match source codes with codes from 'st.codes' local variable and use to return the full state name
  refac.x<-st.codes$full[match(st.x$state,st.codes$state)]
     #return the full state names in the same order in which they appeared in the original source
  return(refac.x)
}
moop <- subset(p_a, moop > 0)

# map this to see which states have the worst MOOP on average for a family.
df <- aggregate(p_a$moop, list(p_a$StateCode), mean)
df$region<-stateFromLower(df$Group.1)
df$value <- df$x
choro = StateChoropleth$new(df)
choro$title = "Average Max Out of Pocket"
choro$set_num_colors(1)
myPalette <- colorRampPalette(brewer.pal(9, "Reds"))
choro$ggplot_polygon = geom_polygon(aes(fill = value), color = NA)
choro$ggplot_scale = scale_fill_gradientn(name = "MOOP", colours = myPalette(9))
choro$render()
## Warning: Column `region` joining character vector and factor, coercing into
## character vector
## Warning in self$bind(): The following regions were missing and are being
## set to NA: minnesota, california, maryland, colorado, washington, vermont,
## kentucky, massachusetts, connecticut, new york, rhode island, district of
## columbia

Idaho is the worst, followed by Arizona and New Mexico. Things look pretty uniform throughout the rest of the country, however, we want to look how MOOP has changed over time as well. Our data has only have two years for the ACA: 2014 and 2015. We would like to see if MOOP has gotten higher.

moop14 <- subset(moop, BusinessYear == "2014")
dim(moop14)
## [1] 11763   177
table(moop14$moop)
## 
##   400   500   600   700   800   900   950  1000  1100  1150  1200  1240 
##     2     1     1     2    40     3     3   178     5     8    39     1 
##  1300  1400  1500  1508  1600  1660  1700  1900  2000  2200  2300  2350 
##    29    44   115    16    12     4     5     1   266    25    14    12 
##  2400  2500  2600  2700  2800  2900  3000  3100  3200  3300  3400  3500 
##    29   126    14     6    65    39   385     2     1     4    20    35 
##  3600  3700  3750  3800  3900  4000  4200  4230  4300  4400  4500  4600 
##    13    14     1    19     1   302    30     8    12     9   533     6 
##  4700  5000  5200  5300  5350  5400  5500  5600  5800  5840  5900  6000 
##     5   212    28     7    10    12    19    13    12     3     2   485 
##  6250  6300  6400  6500  6600  6750  6800  6900  7000  7050  7200  7300 
##     2     3    21    24     2     1     4     2   409    10    26    16 
##  7400  7500  7600  7700  7800  7900  8000  8200  8250  8300  8400  8500 
##     8    22    27     1     2     4   380     5     8     3    30    23 
##  8700  8800  9000  9100  9200  9300  9400  9500  9600  9700  9750  9800 
##     3    16   303     4     4     4     6    56    13     3    42    10 
## 10000 10160 10200 10300 10338 10360 10400 10500 10600 11000 11200 11400 
##   741     4    18     2     4     4   402    83     7   181     2     2 
## 11500 11600 12000 12200 12400 12500 12600 12650 12675 12700 
##    10    31   530     4     2   332   341     3     2  4253
summary(moop14$moop)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##   400.000  6000.000 10400.000  9207.298 12700.000 12700.000
moop15 <- subset(moop, BusinessYear == "2015")
dim(moop15)
## [1] 22314   177
table(moop15$moop)
## 
##   300   400   500   700   800   850   900   950  1000  1050  1100  1150 
##     1     4     1     3    51     2    17     6   469    10    43     8 
##  1200  1240  1250  1300  1350  1400  1450  1500  1508  1520  1600  1650 
##   100     1     1    46     4    64     3   263    23     2    15     2 
##  1660  1700  1750  1800  1900  2000  2100  2200  2300  2400  2500  2600 
##     2    10     2     1     1   387     1    40    21    52   143    36 
##  2700  2800  2900  2950  3000  3050  3100  3200  3250  3300  3400  3450 
##    24    82   174     1   688     1     5    23     3    10    24     1 
##  3500  3600  3650  3700  3750  3800  3850  3900  4000  4100  4150  4200 
##    62    47     1    47     9    12     2    11   691     6     4    88 
##  4250  4300  4400  4500  4600  4800  5000  5100  5200  5300  5350  5400 
##     6    12    23   668     9    15   232     5    29    20     2     9 
##  5500  5600  5800  5840  5900  6000  6150  6200  6250  6350  6400  6500 
##    98    50    25     6     4   611     3     9     4     3    16    33 
##  6600  6700  6750  6800  6900  6950  7000  7050  7100  7200  7300  7400 
##    17     6     2     7     9     2   835     6     1    50    33     9 
##  7450  7500  7600  7650  7700  7800  7850  7900  8000  8100  8150  8200 
##     4    15    53     2     1    10     2    10   593     2     1    16 
##  8250  8300  8400  8500  8600  8700  8800  8850  8900  9000  9050  9100 
##    13     7    39    99     1     5    17     2     1   451     1     6 
##  9150  9200  9300  9400  9450  9500  9600  9650  9700  9750  9800  9850 
##     1    64    12    27     1    79    31     2    42   229    25     5 
##  9900  9950 10000 10050 10100 10200 10300 10338 10360 10400 10500 10600 
##     9     3  1053     2     2    46     6     4     4   668   283    19 
## 10700 10800 10850 10950 11000 11050 11100 11150 11200 11250 11300 11400 
##     3     9     3     8   435     3     6     2    12     2   126    18 
## 11450 11500 11550 11600 11700 11750 11800 11900 12000 12050 12200 12400 
##     5    46     3    40     2     3    18     6   963     3     7    17 
## 12450 12500 12550 12600 12650 12700 12800 12900 13000 13100 13200 
##     3   304     3   558     3  4558   108   559   391     9  3390

One thing to note here: there are LOTS more total plans in 2015. Almost twice as many, actually. That means we need to think about how to display this visually to prevent misleading.

summary(moop15$moop)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##   300.000  6000.000 11175.000  9498.264 12700.000 13200.000
moop16 <- subset(moop, BusinessYear == "2016")
dim(moop16)
## [1]   0 177

Now, this is where things get very interesting. The max MOOP in 2014 was $12,700 and there are many plans with that MOOP. 4253 in total. That’s about a third of all plans at the max MOOP. But then in 2015 things change. The max MOOP goes up to $13,200. And now many plans have higher MOOPs. Now over 9000 plans have family MOOPs of $12,700+. That’s a huge increase. As the MOOP ceiling has gone up, health insurers have moved their MOOP up as well. That’s a worrying trend.

Remebering that the difference between the total count of plans in 2014 and 2015 is large, therefore, we use percentages to display the information in a way that makes sense.

g1<-ggplot(moop14, aes(x = moop14$moop)) +
geom_histogram(aes(y = (..count..)/sum(..count..)), binwidth = 500, col= "black", fill= "orange") +

# for 2014
scale_y_continuous(labels=percent) + labs(title = "MOOP in 2014") + labs(x="Family MOOP", y= "Percent of Total Plans") +  theme(text=element_text(size=15))
g2 <- ggplot(moop15, aes(x = moop15$moop)) +
geom_histogram(aes(y = (..count..)/sum(..count..)), binwidth = 500, col= "black", fill= "darkgreen") +

# for 2015 
scale_y_continuous(labels=percent) + labs(title = "MOOP in 2015") + labs(x="Family MOOP", y= "Percent of Total Plans") +  theme(text=element_text(size=15))
grid.arrange(g1, g2, ncol = 2)

While the ACA has obviously been a huge benefit to families who need it, it’s a little scary to note that that many plans offer the poorest coverage possible under the ACA. It will be interesting to see what happens over the next five years. Will HHS keep allowing MOOP to rise or will they push back? If this data is any indication, health insurers will continue to raise MOOP if they are allowed.

10.1 State Ideology by Family MOOP

What we are looking for is a relationship between liberal states and better health insurance through the exchanges. There are many different ways to quantify “good” health insurance, but we have used two measures - Maximum Out of Pocket Costs for a Family - Monthly Premiums

moop_final = subset(p_a, BusinessYear == "2014" & DentalOnlyPlan == "No")

After doing a quick exploration we see that this dataset has dental insurance mixed in with health insurance.

#drop dental insurance plans.

head(moop_final$TEHBInnTier1FamilyMOOP, 50)
##  [1] "12700" "8000"  "8000"  "12700" "0"     "9500"  "12000" "0"    
##  [9] "9500"  "12000" "12700" "12700" "0"     "12700" "12000" "9500" 
## [17] "12000" "12700" "10400" "2500"  "1000"  "9500"  "0"     "12700"
## [25] "10400" "2500"  "1000"  "12700" "2500"  "12700" "1000"  "12000"
## [33] "12700" "12000" "0"     "12000" "9500"  "0"     "9500"  "12700"
## [41] "12700" "8000"  "12700" "8000"  "12700" "12700" "12700" "12700"
## [49] "0"     "8000"
#removing '$' and comma signs

moop_final$TEHBInnTier1FamilyMOOP<- gsub('\\$', '', moop_final$TEHBInnTier1FamilyMOOP)
moop_final$TEHBInnTier1FamilyMOOP<- gsub(',', '', moop_final$TEHBInnTier1FamilyMOOP)
moop_final$MOOP <- as.numeric(moop_final$TEHBInnTier1FamilyMOOP)
## Warning: NAs introduced by coercion
moop_final$MOOP[is.na(moop_final$MOOP)] <- 0
head(moop_final$MOOP, 50)
##  [1] 12700  8000  8000 12700     0  9500 12000     0  9500 12000 12700
## [12] 12700     0 12700 12000  9500 12000 12700 10400  2500  1000  9500
## [23]     0 12700 10400  2500  1000 12700  2500 12700  1000 12000 12700
## [34] 12000     0 12000  9500     0  9500 12700 12700  8000 12700  8000
## [45] 12700 12700 12700 12700     0  8000
counts <- table(moop_final$MOOP)
#counts
barplot(counts, main="Max Out of Pocket", xlab="Amount (in $)")

The scale is very bimodal. Very right and left censored. We see that the max value is 12700 and that over 4000 plans have that as their max out of pocket. After doing some research we find that $12,700 is the maximum value for MOOP in plans available through the ACA. Makes sense. Now, we want to find out what the average MOOP is for each state that is contained in the dataset.

moop_agg <- aggregate(moop_final$MOOP, list(moop_final$StateCode), mean)
names(moop_agg) <- c("State", "Moop")
moop_agg
##    State        Moop
## 1     AK 7895.238095
## 2     AL 7861.702128
## 3     AR 7756.797251
## 4     AZ 7242.530541
## 5     DE 6350.588235
## 6     FL 5706.446414
## 7     GA 7593.617021
## 8     IA 7450.909091
## 9     ID 7509.278351
## 10    IL 8240.188088
## 11    IN 7738.820302
## 12    KS 6594.807692
## 13    LA 6854.646925
## 14    ME 7305.755396
## 15    MI 6297.266515
## 16    MO 7071.164021
## 17    MS 7516.126761
## 18    MT 7775.789474
## 19    NC 7743.648208
## 20    ND 8026.666667
## 21    NE 7861.867089
## 22    NH 6695.348837
## 23    NJ 7566.513761
## 24    NM 8171.707317
## 25    OH 7099.120980
## 26    OK 7777.937337
## 27    PA 7026.809651
## 28    SC 4445.089286
## 29    SD 7007.658537
## 30    TN 6566.953317
## 31    TX 7849.168053
## 32    UT 7293.362832
## 33    VA 7813.603819
## 34    WI 6619.328165
## 35    WV 6195.161290
## 36    WY 7544.761905

10.2 Ideology

Now, we need to find a measure of ideology. Thankfully, Richard Fording has a dataset that contains a score for each state in the United States. The latest scores are for the year 2014, that’s why we only used that year in our earlier subsetting. There wasn’t a really good way to do this using R, so we just created the vector by hand.

Higher values is more liberal and lower values is more conservative. South Carolina has a score of 0. The most conservative state.

Full data avaialble here: https://rcfording.wordpress.com/state-ideology-data/

Using Ideology data by Richard Fording on this webiste : https://rcfording.wordpress.com/state-ideology-data/ Handpicking the Ideology values form the dataset on the above mentioned webiste, we get the following vector.

moop_agg$Ideology <- c(35.44, 19.05, 48.89, 3.02, 76.58, 11.33, 3.12, 34.38, 8.78, 83.17, 10.24, 5.38, 14.02, 67.14, 11.17, 47.6, 26.71, 43.46, 6.65, 26.91, 15.68, 66.01, 54.12, 40.63, 11.85, 8.75, 27.45, 0, 22.85, 10.68, 6.97, 6.31, 51.35, 6.10, 72.81, 5.14)
moop_agg
##    State        Moop Ideology
## 1     AK 7895.238095    35.44
## 2     AL 7861.702128    19.05
## 3     AR 7756.797251    48.89
## 4     AZ 7242.530541     3.02
## 5     DE 6350.588235    76.58
## 6     FL 5706.446414    11.33
## 7     GA 7593.617021     3.12
## 8     IA 7450.909091    34.38
## 9     ID 7509.278351     8.78
## 10    IL 8240.188088    83.17
## 11    IN 7738.820302    10.24
## 12    KS 6594.807692     5.38
## 13    LA 6854.646925    14.02
## 14    ME 7305.755396    67.14
## 15    MI 6297.266515    11.17
## 16    MO 7071.164021    47.60
## 17    MS 7516.126761    26.71
## 18    MT 7775.789474    43.46
## 19    NC 7743.648208     6.65
## 20    ND 8026.666667    26.91
## 21    NE 7861.867089    15.68
## 22    NH 6695.348837    66.01
## 23    NJ 7566.513761    54.12
## 24    NM 8171.707317    40.63
## 25    OH 7099.120980    11.85
## 26    OK 7777.937337     8.75
## 27    PA 7026.809651    27.45
## 28    SC 4445.089286     0.00
## 29    SD 7007.658537    22.85
## 30    TN 6566.953317    10.68
## 31    TX 7849.168053     6.97
## 32    UT 7293.362832     6.31
## 33    VA 7813.603819    51.35
## 34    WI 6619.328165     6.10
## 35    WV 6195.161290    72.81
## 36    WY 7544.761905     5.14
plot(moop_agg$Moop , moop_agg$Ideology)

ggplot(moop_agg, aes(x=moop_agg$Moop, y=moop_agg$Ideology, fill = moop_agg$State)) +
    geom_bar(stat = "Identity", width = 20)
## Warning: position_stack requires non-overlapping x intervals

Inspired from https://github.com/apalbright/NewYorker/blob/master/scatter.R

my_theme <- function() {
# Define colors for the chart
palette <- brewer.pal("Greys", n=9)
color.background = palette[2]
color.grid.major = palette[4]
color.panel = palette[3]
color.axis.text = palette[9]
color.axis.title = palette[9]
color.title = palette[9]
# Create basic construction of chart
theme_bw(base_size=9, base_family="Georgia") +
# Set the entire chart region to a light gray color
theme(panel.background=element_rect(fill=color.panel, color=color.background)) +
theme(plot.background=element_rect(fill=color.background, color=color.background)) +
theme(panel.border=element_rect(color=color.background)) +
# Format grid
theme(panel.grid.major=element_line(color=color.grid.major,size=.25)) +
theme(panel.grid.minor=element_blank()) +
theme(axis.ticks=element_blank()) +
# Format legend
theme(legend.position="right") +
theme(legend.background = element_rect(fill=color.panel)) +
theme(legend.text = element_text(size=10,color=color.axis.title)) +
# Format title and axes labels these and tick marks
theme(plot.title=element_text(color=color.title, size=20, vjust=0.5, hjust=0, face="bold")) +
theme(axis.text.x=element_text(size=10,color=color.axis.text)) +
theme(axis.text.y=element_text(size=10,color=color.axis.text)) +
theme(axis.title.x=element_text(size=12,color=color.axis.title, vjust=-1, face="italic")) +
theme(axis.title.y=element_text(size=12,color=color.axis.title, vjust=1.8, face="italic")) +
# Plot margins
theme(plot.margin = unit(c(.5, .5, .5, .5), "cm"))
}

10.3 Regression on the State Ideology scores vs MOOP for family

We are going to throw a regression line on the visualization just to give a sense of relationship.

ggplot(moop_agg, aes(Moop, Ideology))+
my_theme()+
geom_point(shape=1) +
geom_smooth(method=lm)+
labs(title= "", x="Max Out of Pocket for Family", y="State Ideology Scores")+
ggtitle(expression(atop(bold("Do Blue States Fare Better Under the ACA?"), atop(italic("Association between State Liberalism and Max Out of Pocket"),""))))+
geom_text(aes(label=State), vjust=-1, hjust=0.5, size=2)+theme(plot.title = element_text(size = 16, colour = "black", vjust = 0.5, hjust=0.5))

Looks like more liberal states actually have HIGHER overall MOOPs and more conservative states have LOWER MOOPs. However the relationship isn’t statistically significant.

We want to do the same thing for premiums. However, we need to load a new dataset.

df2 <- aggregate(rate_final_2014$IndividualRate, list(rate_final_2014$StateCode), mean)
names(df2) <- c("State", "Rate")
df2
##    State        Rate
## 1     AK 687.4846512
## 2     AL 310.4216273
## 3     AR 191.7281400
## 4     AZ 358.3430163
## 5     DE 303.5045784
## 6     FL 225.0343449
## 7     GA 236.2480380
## 8     IA 352.0917299
## 9     ID 335.5906603
## 10    IL 393.6916331
## 11    IN 470.4898775
## 12    KS 288.8863991
## 13    LA 388.6851162
## 14    ME 386.0573926
## 15    MI 286.4078646
## 16    MO 157.7844068
## 17    MS 336.7787666
## 18    MT 266.3338661
## 19    NC 343.6952749
## 20    ND 333.7062557
## 21    NE 339.7309144
## 22    NH 309.9268235
## 23    NJ 431.8352962
## 24    NM 262.8701088
## 25    OH 414.9238247
## 26    OK 369.7863552
## 27    PA 349.6579366
## 28    SC 302.8205983
## 29    SD 441.6457359
## 30    TN 323.4958361
## 31    TX 212.6574451
## 32    UT 282.5370221
## 33    VA 414.1727704
## 34    WI 474.7037436
## 35    WV 206.1047558
## 36    WY 451.4464911

Now we will merge this new dataframe with the previous one that is constructed.

total <- merge(moop_agg,df2,by=c("State"))
total
##    State        Moop Ideology        Rate
## 1     AK 7895.238095    35.44 687.4846512
## 2     AL 7861.702128    19.05 310.4216273
## 3     AR 7756.797251    48.89 191.7281400
## 4     AZ 7242.530541     3.02 358.3430163
## 5     DE 6350.588235    76.58 303.5045784
## 6     FL 5706.446414    11.33 225.0343449
## 7     GA 7593.617021     3.12 236.2480380
## 8     IA 7450.909091    34.38 352.0917299
## 9     ID 7509.278351     8.78 335.5906603
## 10    IL 8240.188088    83.17 393.6916331
## 11    IN 7738.820302    10.24 470.4898775
## 12    KS 6594.807692     5.38 288.8863991
## 13    LA 6854.646925    14.02 388.6851162
## 14    ME 7305.755396    67.14 386.0573926
## 15    MI 6297.266515    11.17 286.4078646
## 16    MO 7071.164021    47.60 157.7844068
## 17    MS 7516.126761    26.71 336.7787666
## 18    MT 7775.789474    43.46 266.3338661
## 19    NC 7743.648208     6.65 343.6952749
## 20    ND 8026.666667    26.91 333.7062557
## 21    NE 7861.867089    15.68 339.7309144
## 22    NH 6695.348837    66.01 309.9268235
## 23    NJ 7566.513761    54.12 431.8352962
## 24    NM 8171.707317    40.63 262.8701088
## 25    OH 7099.120980    11.85 414.9238247
## 26    OK 7777.937337     8.75 369.7863552
## 27    PA 7026.809651    27.45 349.6579366
## 28    SC 4445.089286     0.00 302.8205983
## 29    SD 7007.658537    22.85 441.6457359
## 30    TN 6566.953317    10.68 323.4958361
## 31    TX 7849.168053     6.97 212.6574451
## 32    UT 7293.362832     6.31 282.5370221
## 33    VA 7813.603819    51.35 414.1727704
## 34    WI 6619.328165     6.10 474.7037436
## 35    WV 6195.161290    72.81 206.1047558
## 36    WY 7544.761905     5.14 451.4464911
ggplot(total, aes(x=Rate, y=Ideology))+
my_theme()+
geom_point(shape=1) +
geom_smooth(method=lm)+
labs(title= "", x="Monthly Premium", y="State Ideology Scores")+
ggtitle(expression(atop(bold("Do Blue States Fare Better Under the ACA?"), atop(italic("Association between liberal scores and Health Insurance Premiums"),""))))+
geom_text(aes(label=State), vjust=-1, hjust=0.5, size=2)+
theme(plot.title = element_text(size = 16, face = "bold", colour = "black", vjust = 0.5, hjust=0.5))

Here the relationship is a negative one. More liberal states do have lower monthly health insurance premiums, however this relationship isn’t statistically signficant, either.

11 Discussion/Conclusion

We generated some interesting and meaningful insights from preliminary visualizations and the further analysis that was carried out.

We found out that the states providing most expensive insurance plans on an average are Wyoming and Wisconsin. We also discovered a mostly linear relationship between Age and Individual Plan Rate implying that people in the higher age brackets provide the primary market for revenue generation for health insurance providers/carriers.

The classification results indicated that the most important association rules for different business rules across benefit plans are Domestic Partner as Spouse Indicator and Same Sex Partner as Spouse Indicator.

Another interesting conclusion that was drawn was from the MOOP (Max Out of Pocket) Analysis. As we compared data for the same for different years, there was a clear indication of health insurers raising MOOP every year, if allowed, which doesn’t sound good in the long run. Also, this analysis provided us insight into the state ideology. We found out that the liberality of states is directly proportional to MOOP and inversely proportional to the monthly premiums by state.

We also trained a neural net in order to build a predictive model for the Individual Plan Rate using data from past years. The features (most importantly, Age, Plan Validity/Duration and Tobacco Preference) for the model were carefully selected using the Boruta Package. This model can help us predict the plan rates for any future data.

12 Future Work

We would like to explore more feature selection methods such as RFE (Recursive Feature Elimination) and Stepwise Feature Selection. This may help provide us with better features to build even more accurate predictive models.

Also, in order to overcome the major challenge of huge dataset size ( creating computational limitations) and hence, perform overall better analysis, we would like to make us of SparkR as well as GPU Computing. This will allow us to better handle big data and generate more accurate and relevant insights.

Also, since the dataset was incomplete, providing us information for only 39 states, as part of future work we would also like to carry out analysis on a complete dataset.

13 Acknowledgments

In performing our assignment, we had to take the help and guideline of numerous online resources provided by the University of Michigan, which deserve our greatest gratitude. The completion of this assignment gives us much Pleasure. We would like to show our gratitude to Mr. Ivo Dinov, Course Instructor, Data Science and Predictive Analytics(HS650), University of Michigan, Ann Arbor who introduced us to the Methodology of work, and whose passion for the “underlying structures” had lasting effect on all of us. We reiterate our gratitude to professor for giving us a good guideline for assignment throughout numerous consultations. Many people, especially our classmates and team members itself, have made valuable comment suggestions on this proposal which gave us an inspiration to improve our assignment. We thank all the people for their help directly and indirectly to complete our assignment.